52 research outputs found

    Look Before You Leap: An Exploratory Study of Uncertainty Measurement for Large Language Models

    Full text link
    The recent performance leap of Large Language Models (LLMs) opens up new opportunities across numerous industrial applications and domains. However, erroneous generations, such as false predictions, misinformation, and hallucination made by LLMs, have also raised severe concerns for the trustworthiness of LLMs', especially in safety-, security- and reliability-sensitive scenarios, potentially hindering real-world adoptions. While uncertainty estimation has shown its potential for interpreting the prediction risks made by general machine learning (ML) models, little is known about whether and to what extent it can help explore an LLM's capabilities and counteract its undesired behavior. To bridge the gap, in this paper, we initiate an exploratory study on the risk assessment of LLMs from the lens of uncertainty. In particular, we experiment with twelve uncertainty estimation methods and four LLMs on four prominent natural language processing (NLP) tasks to investigate to what extent uncertainty estimation techniques could help characterize the prediction risks of LLMs. Our findings validate the effectiveness of uncertainty estimation for revealing LLMs' uncertain/non-factual predictions. In addition to general NLP tasks, we extensively conduct experiments with four LLMs for code generation on two datasets. We find that uncertainty estimation can potentially uncover buggy programs generated by LLMs. Insights from our study shed light on future design and development for reliable LLMs, facilitating further research toward enhancing the trustworthiness of LLMs.Comment: 20 pages, 4 figure

    LUNA: A Model-Based Universal Analysis Framework for Large Language Models

    Full text link
    Over the past decade, Artificial Intelligence (AI) has had great success recently and is being used in a wide range of academic and industrial fields. More recently, LLMs have made rapid advancements that have propelled AI to a new level, enabling even more diverse applications and industrial domains with intelligence, particularly in areas like software engineering and natural language processing. Nevertheless, a number of emerging trustworthiness concerns and issues exhibited in LLMs have already recently received much attention, without properly solving which the widespread adoption of LLMs could be greatly hindered in practice. The distinctive characteristics of LLMs, such as the self-attention mechanism, extremely large model scale, and autoregressive generation schema, differ from classic AI software based on CNNs and RNNs and present new challenges for quality analysis. Up to the present, it still lacks universal and systematic analysis techniques for LLMs despite the urgent industrial demand. Towards bridging this gap, we initiate an early exploratory study and propose a universal analysis framework for LLMs, LUNA, designed to be general and extensible, to enable versatile analysis of LLMs from multiple quality perspectives in a human-interpretable manner. In particular, we first leverage the data from desired trustworthiness perspectives to construct an abstract model as an auxiliary analysis asset, which is empowered by various abstract model construction methods. To assess the quality of the abstract model, we collect and define a number of evaluation metrics, aiming at both abstract model level and the semantics level. Then, the semantics, which is the degree of satisfaction of the LLM w.r.t. the trustworthiness perspective, is bound to and enriches the abstract model with semantics, which enables more detailed analysis applications for diverse purposes.Comment: 44 pages, 9 figure

    A Performance Analysis Model of TCP over Multiple Heterogeneous Paths for 5G Mobile Services

    Full text link
    Driven by the primary requirement of emerging 5G mobile services, the demand for concurrent multipath transfer (CMT) is still prominent. Yet, multipath transport protocols are not widely adopted and TCP-based CMT schemes will still be in dominant position in 5G. However, the performance of TCP flow transferred over multiple heterogeneous paths is prone to the link quality asymmetry, the extent of which was revealed to be significant by our field investigation. In this paper, we present a performance analysis model for TCP over multiple heterogeneous paths in 5G scenarios, where both bandwidth and delay asymmetry are taken into consideration. The evaluation adopting parameters from field investigation shows that the proposed model can achieve high accuracy in practical environments. Some interesting inferences can be drawn from the proposed model, such as the dominant factor that affect the performance of TCP over heterogeneous networks, and the criteria of determining the appropriate number of links to be used under different circumstances of path heterogeneity. Thus, the proposed model can provide a guidance to the design of TCP-based CMT solutions for 5G mobile services

    StoryAnalogy: Deriving Story-level Analogies from Large Language Models to Unlock Analogical Understanding

    Full text link
    Analogy-making between narratives is crucial for human reasoning. In this paper, we evaluate the ability to identify and generate analogies by constructing a first-of-its-kind large-scale story-level analogy corpus, \textsc{StoryAnalogy}, which contains 24K story pairs from diverse domains with human annotations on two similarities from the extended Structure-Mapping Theory. We design a set of tests on \textsc{StoryAnalogy}, presenting the first evaluation of story-level analogy identification and generation. Interestingly, we find that the analogy identification tasks are incredibly difficult not only for sentence embedding models but also for the recent large language models (LLMs) such as ChatGPT and LLaMa. ChatGPT, for example, only achieved around 30% accuracy in multiple-choice questions (compared to over 85% accuracy for humans). Furthermore, we observe that the data in \textsc{StoryAnalogy} can improve the quality of analogy generation in LLMs, where a fine-tuned FlanT5-xxl model achieves comparable performance to zero-shot ChatGPT.Comment: Accepted by EMNLP 2023 main conferenc

    Efficacy of apatinib 250 mg combined with chemotherapy in patients with pretreated advanced breast cancer in a real-world setting

    Get PDF
    ObjectivesThis study evaluated the efficacy and safety of apatinib (an oral small-molecule tyrosine kinase inhibitor targeting VEGFR-2) 250 mg combined with chemotherapy in patients with pretreated metastatic breast cancer in a real-world setting.Patients and methodsA database of patients with advanced breast cancer who received apatinib between December 2016 and December 2019 in our institution was reviewed, and patients who received apatinib combined with chemotherapy were included. Progression-free survival (PFS), overall survival (OS), the objective response rate (ORR), the disease control rate (DCR), and treatment-related toxicity were analyzed.ResultsIn total, 52 evaluated patients with metastatic breast cancer previously exposed to anthracyclines or taxanes who received apatinib 250 mg combined with chemotherapy were enrolled in this study. Median PFS and OS were 4.8 (95% confidence interval [CI] = 3.2–6.4) and 15.4 months (95% CI = 9.2–21.6), respectively. The ORR and DCR were 25% and 86.5%, respectively. Median PFS for the previous line of treatment was 2.1 months (95% CI = 0.65–3.6), which was significantly shorter than that for the apatinib–chemotherapy combination (p < 0.001). No significant difference was identified in the ORR and PFS among the subgroups(subtypes, target lesion, combined regimens and treatment lines). The common toxicities related to apatinib were hypertension, hand-foot syndrome, proteinuria, and fatigue events.ConclusionApatinib 250 mg combined with chemotherapy provided favorable efficacy in patients with pretreated metastatic breast cancer regardless of molecular types and treatment lines. The toxicities of the regimen were well tolerated and manageable. This regimen could be a potential treatment option in patients with refractory pretreated metastatic breast cancers

    Finishing the euchromatic sequence of the human genome

    Get PDF
    The sequence of the human genome encodes the genetic instructions for human physiology, as well as rich information about human evolution. In 2001, the International Human Genome Sequencing Consortium reported a draft sequence of the euchromatic portion of the human genome. Since then, the international collaboration has worked to convert this draft into a genome sequence with high accuracy and nearly complete coverage. Here, we report the result of this finishing process. The current genome sequence (Build 35) contains 2.85 billion nucleotides interrupted by only 341 gaps. It covers ∼99% of the euchromatic genome and is accurate to an error rate of ∼1 event per 100,000 bases. Many of the remaining euchromatic gaps are associated with segmental duplications and will require focused work with new methods. The near-complete sequence, the first for a vertebrate, greatly improves the precision of biological analyses of the human genome including studies of gene number, birth and death. Notably, the human enome seems to encode only 20,000-25,000 protein-coding genes. The genome sequence reported here should serve as a firm foundation for biomedical research in the decades ahead

    Bayer Image Demosaicking Using Eight-Directional Weights Based on the Gradient of Color Difference

    No full text
    In this paper, we propose a new demosaicking algorithm which uses eight-directional weights based on the gradient of color difference (EWGCD) for Bayer image demosaicking. To obtain the interpolation of green (G) pixels, the eight-directional G pixel values are first estimated in red (R)/blue (B) pixels. This estimate is used to calculate the color difference in R/B pixels of the Bayer image in diagonal directions. However, in horizontal and vertical directions, the new estimated G pixels are defined to obtain the color difference. The eight-directional weights of estimated G pixels can be obtained by considering the gradient of the color difference and the gradient of the RGB pixels of the Bayer image. Therefore, the eight-directional weighted values and the first estimated G pixel values are combined to obtain the full G image. Compared with six similar algorithms using the same eighteen McMaster images, the results of the experiment demonstrate that the proposed algorithm has a better performance not only in the subjective visual measurement but also in the assessments of peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) index measurement
    corecore